40 research outputs found

    Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum

    Get PDF
    We have developed an algorithm called Q5 for probabilistic classification of healthy vs. disease whole serum samples using mass spectrometry. The algorithm employs Principal Components Analysis (PCA) followed by Linear Discriminant Analysis (LDA) on whole spectrum Surface-Enhanced Laser Desorption/Ionization Time of Flight (SELDI-TOF) Mass Spectrometry (MS) data, and is demonstrated on four real datasets from complete, complex SELDI spectra of human blood serum. Q5 is a closed-form, exact solution to the problem of classification of complete mass spectra of a complex protein mixture. Q5 employs a novel probabilistic classification algorithm built upon a dimension-reduced linear discriminant analysis. Our solution is computationally efficient; it is non-iterative and computes the optimal linear discriminant using closed-form equations. The optimal discriminant is computed and verified for datasets of complete, complex SELDI spectra of human blood serum. Replicate experiments of different training/testing splits of each dataset are employed to verify robustness of the algorithm. The probabilistic classification method achieves excellent performance. We achieve sensitivity, specificity, and positive predictive values above 97% on three ovarian cancer datasets and one prostate cancer dataset. The Q5 method outperforms previous full-spectrum complex sample spectral classification techniques, and can provide clues as to the molecular identities of differentially-expressed proteins and peptides

    High-Throughput Inference of Protein-Protein Interaction Sites from Unassigned NMR Data by Analyzing Arrangements Induced By Quadratic Forms on 3-Manifolds

    Get PDF
    We cast the problem of identifying protein-protein interfaces, using only unassigned NMR spectra, into a geometric clustering problem. Identifying protein-protein interfaces is critical to understanding inter- and intra-cellular communication, and NMR allows the study of protein interaction in solution. However it is often the case that NMR studies of a protein complex are very time-consuming, mainly due to the bottleneck in assigning the chemical shifts, even if the apo structures of the constituent proteins are known. We study whether it is possible, in a high-throughput manner, to identify the interface region of a protein complex using only unassigned chemical shift and residual dipolar coupling (RDC) data. We introduce a geometric optimization problem where we must cluster the cells in an arrangement on the boundary of a 3-manifold. The arrangement is induced by a spherical quadratic form, which in turn is parameterized by SO(3)xR^2. We show that this formalism derives directly from the physics of RDCs. We present an optimal algorithm for this problem that runs in O(n^3 log n) time for an n-residue protein. We then use this clustering algorithm as a subroutine in a practical algorithm for identifying the interface region of a protein complex from unassigned NMR data. We present the results of our algorithm on NMR data for 7 proteins from 5 protein complexes and show that our approach is useful for high-throughput applications in which we seek to rapidly identify the interface region of a protein complex

    A Structure-Based Approach for Mapping Adverse Drug Reactions to the Perturbation of Underlying Biological Pathways

    Get PDF
    Adverse drug reactions (ADR), also known as side-effects, are complex undesired physiologic phenomena observed secondary to the administration of pharmaceuticals. Several phenomena underlie the emergence of each ADR; however, a dominant factor is the drug's ability to modulate one or more biological pathways. Understanding the biological processes behind the occurrence of ADRs would lead to the development of safer and more effective drugs. At present, no method exists to discover these ADR-pathway associations. In this paper we introduce a computational framework for identifying a subset of these associations based on the assumption that drugs capable of modulating the same pathway may induce similar ADRs. Our model exploits multiple information resources. First, we utilize a publicly available dataset pairing drugs with their observed ADRs. Second, we identify putative protein targets for each drug using the protein structure database and in-silico virtual docking. Third, we label each protein target with its known involvement in one or more biological pathways. Finally, the relationships among these information sources are mined using multiple stages of logistic-regression while controlling for over-fitting and multiple-hypothesis testing. As proof-of-concept, we examined a dataset of 506 ADRs, 730 drugs, and 830 human protein targets. Our method yielded 185 ADR-pathway associations of which 45 were selected to undergo a manual literature review. We found 32 associations to be supported by the scientific literature

    Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum

    Get PDF
    We have developed an algorithm called Q5 for probabilistic classification of healthy versus disease whole serum samples using mass spectrometry. The algorithm employs principal components analysis (PCA) followed by linear discriminant analysis (LDA) on whole spectrum surface-enhanced laser desorption/ionization time of flight (SELDI-TOF) mass spectrometry (MS) data and is demonstrated on four real datasets from complete, complex SELDI spectra of human blood serum. Q5 is a closed-form, exact solution to the problem of classification of complete mass spectra of a complex protein mixture. Q5 employs a probabilistic classification algorithm built upon a dimension-reduced linear discriminant analysis. Our solution is computationally efficient; it is noniterative and computes the optimal linear discriminant using closed-form equations. The optimal discriminant is computed and verified for datasets of complete, complex SELDI spectra of human blood serum. Replicate experiments of different training/testing splits of each dataset are employed to verify robustness of the algorithm. The probabilistic classification method achieves excellent performance. We achieve sensitivity, specificity, and positive predictive values above 97 % on three ovarian cancer datasets and one prostate cancer dataset. The Q5 method outperforms previous full-spectrum complex sample spectral classification techniques and can provide clues as to the molecular identities of differentially expressed proteins and peptides

    Identification of Novel Small Molecule Inhibitors of Core-Binding Factor Dimerization by Computational Screening against NMR Molecular Ensembles

    Get PDF
    The long development process of novel pharmaceutical compounds begins with the identification of a lead inhibitor compound. Computational screening to identify those ligands, or small molecules, most likely to inhibit a target pro-tein may benefit the pharmaceutical development process by reducing the time required to identify a lead compound. Typically, computational ligand screening utilizes high-resolution structural models of both the protein and ligand to fit or ‘dock ’ each member of a ligand database into the binding site of the protein. Ligands are then ranked by the number and quality of interactions formed in the predicted protein-ligand complex. It is currently believed that proteins in solution do not assume a single rigid conformation but instead tend to move through a small region of conformation space [14]. Therefore, docking ligands against a static snapshot of protein structure has predictive limi-tations because it ignores the inherent flexibility of the protein [14]. A challenge, therefore, has been the development of docking algorithms capable of modeling protein flexibility while balancing computational feasibility. In this paper, we present our initial development and work on a molecular ensemble-based algorithm to model protein flexibility for protein-ligand binding prediction. First, a molecular ensemble is generated from molecular structures satisfying experimentally-measured NMR constraints. Second, traditional protein-ligand docking is performed on each mem-ber of the protein’s molecular ensemble. This step generates lists of ligands predicted to bind to each individual member of the ensemble. Finally, lists of top predicted binders are consolidated to identify those ligands predicted to bind multiple members of the protein’s molecular ensemble. We applied our algorithm to identify inhibitors of Core Binding Factor (CBF) among a subset of approximately 70,000 ligands of the Available Chemicals Directory. Our 26 top-predicted binding ligands are currently being tested experimentally in the wetlab by both NMR-bindin
    corecore